The dataset we explored is on the monthly rental prices of the United States of America from 2010 to 2017 sourced from the application Zillow. This dataset was found on the website Kaggle and has no listed author. The contents of the dataset covers all types of properties such as apartments, townhomes, single family, and multi-family properties that make up the median city code price listing. It is important to note a few important facts about the dataset before going into any more detail. First off, the rental prices are not listing of physical properties but in fact is the median price of the specific city code. Secondly, the dataset does not include every geographical location in the United States and does not include the state of Alaska. Our dataset consists of 13131 observations. There are 1738 rows that are completely empty so after removing these rows we have 11348 observations and 81 variables. Our variables consist of the city code, which is the unique identification of each observation, city, state, metro, country, and population rank. The rest of the variables are time stamps of the monthly rental price of that row’s city code from November 2010 to January 2017. Zillow creates rent forecasting grounded on exclusive statistical and machine learning models also known as Zestimates. Zestimate history only goes back until November 2010 hence why the first time stamp begins on this date. Our dataset was last updated on January 2017 and will be the last variable of the overall dataset.
Although the dataset has 11348 substantial observations, we need to highlight limitations and biases of the dataset we have chosen to explore. As was stated above the dataset does not include every single zip code in the United States and affected certain results in our EDA. There are a plethora of websites and companies that list rental properties. A major bias is that we have chosen a dataset that limits itself to only using properties that were listed on the Zillow website, so even when talking about the changes of locations and their rental prices it is only considering the information that was listed on Zillow and used for their Zestimate.
The goal of exploring the Zillow dataset was to answer and look into three main questions that were introduced about how time and location affected rental prices. The first question is between the east and west coast what region’s price rose the most or were they similar from the first time stamp to the last time stamp? Was there ever overlapping in pricing? Secondly, what cities saw the most change in price differentiation amongst the cities that were picked? Lastly, with the total population …………………….
The first question to be discussed will be the differences and comparisons of the East coast and West coast. Each region is defined by what the United States refers to as East and West coast. The East coast is made up of fourteen states and after subsetting from the original lump data frame has 3988 observations. The West coast is comprised of five states and following the creation of its own data frame the west coast has 1221 observations. We will first look at the frequency of rental prices by viewing the histograms of each region and the month of November with the year 2010 depicted in the light red color and year 2016 shown in the light blue color. For each chart shown there will be a plot including outliers and then a chart not including outliers to help with distribution.
library(ggplot2)
colors <- c("November.2010" = "red", "November.2016" = "blue")
ggplot(westcoast_annual) +
geom_histogram(aes(November.2010, fill = "November.2010"),alpha = 0.4, bins = 40) +
geom_histogram(aes(November.2016, fill = "November.2016"),alpha = 0.4, bins = 40) +
labs(title = "Rental Price Count on the West Coast from November 2010 to November 2016 with Outliers", x = "Rental Prices",
y = "Frequency", fill = "Legend") +
scale_color_manual(values = colors)
ggplot(westcoast_annualClean) +
geom_histogram(aes(November.2010, fill = "November.2010"),alpha = 0.4, bins = 40) +
geom_histogram(aes(November.2016, fill = "November.2016"),alpha = 0.4, bins = 40) +
labs(title = "Rental Price Count on the West Coast from November 2010 to November 2016 without Outliers", x = "Rental Prices",
y = "Frequency", fill = "Legend") +
scale_color_manual(values = colors)
The most observed rental price for the West coast for both 2010 and 2016 of November sat right around 1250 a month with the bin shifting slightly to the right for the year 2016 with the number of frequencies being above 100 for 2010 and about 75 for the year 2016. Prices for this region did get more expensive and we can notice that for 2016 the plot is crawling up the x-axis showing that prices are climbing.
colors <- c("November.2010" = "red", "November.2016" = "blue")
ggplot(eastcoast_annual) +
geom_histogram(aes(November.2010, fill = "November.2010"),alpha = 0.4, bins = 40) +
geom_histogram(aes(November.2016, fill = "November.2016"),alpha = 0.4, bins = 40) +
labs(title = "Rental Price Count on the East Coast from November 2010 to November 2016", x = "Rental Prices",
y = "Frequency", fill = "Legend") +
scale_color_manual(values = colors)
ggplot(eastcoast_annualClean) +
geom_histogram(aes(November.2010, fill = "November.2010"),alpha = 0.4, bins = 40) +
geom_histogram(aes(November.2016, fill = "November.2016"),alpha = 0.4, bins = 40) +
labs(title = "Rental Price Count on the East Coast from November 2010 to November 2016", x = "Rental Prices",
y = "Frequency", fill = "Legend") +
scale_color_manual(values = colors)
For the East coast, it is clear again that about 1250 a month for the rental price is the most frequent price. Just like with the West coast we see that the blue, depicting the year 2016, is pulling itself up the x-axis again showing climbing rental prices. One thing to note about the East coast plots is the y-axis clearly being a larger number showing just the differences in observations between the two regions.
When looking at each state in the separate regions when plotted as a boxplot it is clear there is a wide variety of means and interquartile ranges. For the West coast, we see all states for year 2010 and 2016 have overlapping interquartile ranges telling us that for the West coast the prices are somewhat similar for each state. Going back to the limitations of the dataset, it’s obvious when plotted that the west coast holds two states known nationally for having higher rental prices this being Hawaii and California, and although Alaska is defined as a West coast state it was not included in the original dataset. Furthermore, the East coast region when plotted shows interquartile ranges and means that are not always overlapping with the other states in the region. This again highlights some bias of which city codes were chosen for when the original dataset was assembled.
ggplot(westcoast_annualClean, aes(State, November.2010, color=State)) + geom_boxplot()+labs(title = "West Coast States November 2010 Rent Prices without Outliers")
ggplot(westcoast_annualClean, aes(State, November.2016, color=State)) + geom_boxplot()+labs(title = "West Coast States November 2016 Rent Prices without Outliers")
ggplot(eastcoast_annualClean, aes(State, November.2010, color=State)) + geom_boxplot()+labs(title = "East Coast States November 2010 Rent Prices without Outliers")
ggplot(eastcoast_annualClean, aes(State, November.2016, color=State)) + geom_boxplot()+labs(title = "East Coast States November 2016 Rent Prices without Outliers")
Clearly after plotting the coasts as a whole region in lieu of the coast’s states that the West coast generally has higher rental prices both on the November 2010 plot and on the November 2016 plot. The average of the West coast monthly rental price on November 2010 is 1534 a month with the interquartile ranging from 1171 to 1834 a month. For the November 2016 plot, the average monthly rental price is 1823 a month with a interquartile range spanning from 1294 to 2272 a month.
ggplot(westcoast_annualClean, aes(x=November.2010)) + geom_boxplot(color="blue", outlier.shape = 9, outlier.size = 2, outlier.color = "blue") + labs(title="Entire West Coast November 2010 Boxplot") + scale_x_continuous(breaks=seq(500, 4000, 500))
ggplot(westcoast_annualClean, aes(x=November.2016)) + geom_boxplot(color="blue", outlier.shape = 9, outlier.size = 2, outlier.color = "blue") + labs(title="Entire West Coast November 2016 Boxplot") + scale_x_continuous(breaks=seq(500, 4000, 500))
It is apparent of the price difference between the two regions with the
East coast average monthly rental price on November 2010 being 1370 a
month with the interquartile range being 1082 to 1598, considerably
lower than the 2010 data from the West coast. For the month of November
in 2016 in the East coast the average monthly rent is 1498 with a
interquartile range comprising of 1170 to 1765, yet again lower than the
West’s 2016 numbers.
ggplot(eastcoast_annualClean, aes(x=November.2010)) + geom_boxplot(color="red", outlier.shape = 9, outlier.size = 2, outlier.color = "red") + labs(title="Entire East Coast November 2010 Boxplot") + scale_x_continuous(breaks=seq(500, 4000, 500))
ggplot(eastcoast_annualClean, aes(x=November.2016)) + geom_boxplot(color="red", outlier.shape = 9, outlier.size = 2, outlier.color = "red") + labs(title="Entire East Coast November 2016 Boxplot") + scale_x_continuous(breaks=seq(500, 4000, 500))
Now knowing that the West coast easily coast around 200 to 300 dollars more a month, the last question to answer and to reassure that the West coast region costs more on average is to see if there was ever overlapping in means. This being done by using t-tests on the variables November 2010 and the variable November 2016. The t-test was set at two different intervals one being at 80 and the other being at 99 percent. For the time variable of November 2010 at an 80 percent confidence interval the range spanned from 1515 to 1553 for the West coast and ranged from 1362 to 1378 for the East coast. In 2016, still at 80 percent, the West coast had a range of 1798 to 1849 and the East with a range from 1489 to 1508. When setting the confidence interval at 99 percent the West has a range of 1497 to 1571 for 2010 and 1771 to 1876 for the year 2016. The East coast had a range from 1354 to 1386 for 2010 and a span of 1480 to 1517 for 2016. The results of the t-test endorse the conclusion that the West coast had a higher average monthly rental price and never had overlapping at the 80 or 99 percent confidence interval of the t-tests with their being a differnece again of roughly 200 to 300 dollars a month.
While our dataset looked at all of the city codes in the United States, it also allows us to dig deeper. We picked multiple cities of interest in order to analyze the dataset at a more personal level. Our cities of interest are NYC (Queens) (with only data from December 2011 onwards), LA, PA, Houston (Harris), Chicago (Cook), Dallas, Las Vegas, San Fran, Detroit (Wayne), Roanoke, Richmond, District of Columbia, Baltimore, Seattle(King). We picked these cities off of multiple metrics, their population, their closeness and significance to Virginia and D.C., and their significance in respect to the rent/housing events in the time period of 2011 to 2017.
The information was gathered from the Zillow database of more than 100 million homes at the time. The data points are median terms given for each month over the time period and attributed to each city code. This can be a cause of bias when we are trying to look at the total population of rent prices over the United States, as first of all, city codes do not evenly represent the entire US. Because the data is only from Zillow, it may under or over represent some areas, or give misleading points, depending on how commonly Zillow is used in those areas. Likewise, the dataset of medians is derived from a machine learning algorithm, which can have its own errors. Because this dataset is showing the median values over time, the analysis of median vs mean as the better center of measure in the context of rent prices has most likely already been completed on the data compiler’s side. This indicates that the original data set had already calculated the measures of centers along the time frame for us.
For example, NYC ranked as the highest population city code in the United States, which comes to no surprise. Unfortunately, the data for NYC had missing values before 2012, which was one of the limitations of the data. Not all the city codes were covered equally, and there were a few areas that had a lot of missing data points. Still, the data was mostly complete, as only 1 out of the 12 specifically selected cities of interest had missing values. For New York City, this could be a point of interest for future analysis, either to try and supplement the data with more NYC, or make separate plots/analyze these cities from only 2012 and onwards to try and draw more conclusions from the dataset.
ggplot(zillow_melt, aes(x=Date, y=Prices, group=City, color=City, na.rm=TRUE)) + geom_point() + geom_line() + labs(x="Time", y="Rent Price", title="Plot of City Rent Prices over Time")
On the other hand, cities like Detroit and San Francisco were also selected for their significance at the time. Detroit went through the largest municipal bankruptcy ever seen in the United States in 2013 (Davey & Walsh, 2013). With an estimated debt between 18 and 20 billion dollars, the city had been steadily shrinking in population since the 20th century, and parts of the city became abandoned. Due to these facts, it was expected that the median would fall over time from 2011 to 2017. This was somewhat verified by the time series plots themselves, as Detroit would consistently be at the bottom of the graphs.
zillow_cmelt <- melt(zillow_change, id = "Date")
names(zillow_cmelt)[2] <- "City"
names(zillow_cmelt)[3] <- "PriceChange"
ggplot(zillow_cmelt, aes(x=Date, y=PriceChange, group=City, color=City, na.rm=TRUE)) + geom_point() + geom_line() + labs(x="Time", y="Price Change", title="Plot of City Rent Change over Time")
zillow_cumumelt <- melt(zillow_cumulative, id = "Date")
names(zillow_cumumelt)[2] <- "City"
names(zillow_cumumelt)[3] <- "CumulativePriceChange"
ggplot(zillow_cumumelt, aes(x=Date, y=CumulativePriceChange, group=City, color=City, na.rm=TRUE)) + geom_point() + geom_line() + labs(x="Time", y="Cumulative Change", title="Plot of Cumulative City Rent Change over Time")
San Francisco, on the other hand, had been dealing with a different problem. At the time, they were ranked among the cities with the highest wealth disparities, and now in 2022, they rank among the top 3. This gap in income inequality since 1999 can also be seen in the rising housing costs in the Bay Area (Karlinsky & Wang, 2021). Likewise, they were going through available public housing issues, as the new jobs opportunities available in San Francisco rapidly out-scaled the available housing due to a scarcity of new buildings. This, combined with the wealth disparity, could possibly have caused bidding wars between the wealthy surrounding the scarce housing, and elevated the prices for all. This is speculation, but from the graphs, it can be seen that San Francisco had the highest prices and proportional change out of the 12 cities by the end of the series, and the second highest cumulative proportional change, only losing out to another west coast city.
zillow_pmelt <- melt(zillow_proportion, id = "Date")
names(zillow_pmelt)[2] <- "City"
names(zillow_pmelt)[3] <- "ProportionChange"
ggplot(zillow_pmelt, aes(x=Date, y=ProportionChange, group=City, color=City, na.rm=TRUE)) + geom_point() + geom_line() + labs(x="Time", y="Proportional Change", title="Plot of City Rent Change over Time")
zillow_pcumumelt <- melt(zillow_pcumulative, id = "Date")
names(zillow_pcumumelt)[2] <- "City"
names(zillow_pcumumelt)[3] <- "CumulativeProportionChange"
ggplot(zillow_pcumumelt, aes(x=Date, y=CumulativeProportionChange, group=City, color=City, na.rm=TRUE)) + geom_point() + geom_line() + labs(x="Time", y="Cumu. Prop. Change", title="Plot of Cumulative City Rent Change over Time")
We developed our SMART questions to best use our dataset, to look at the average dollar change across the states and cities. In the same vein of thought, we thought it would be apt to take a deeper look at average change in the cities of interest, taking into account our time constraints.
Before beginning our initial EDA, our focus was to track the average change, and by taking this change as a percent proportion of the original rental prices over the months, we believed it would be the most comparable metric to look at from city code to city code. Here, we took a quick histogram of the distribution of proportional changes. The proportional changes over time look pretty normal taken together as a whole, with maybe a slight right skew, which makes sense as we would guess that rent prices should generally increase over time with inflation and what not. Remember, these changes are in percentages. On the boxplot, there are outliers on both sides, which matches what we see in the histogram. We also have our measures of variance and qqplot as well.
ggplot(zillow_pmelt, aes(x=ProportionChange, na.rm=TRUE)) + geom_histogram() + labs(x = "Proportional Change in Percentages")
ggplot(zillow_pmelt, aes(x=ProportionChange, na.rm=TRUE)) + geom_boxplot()
qqnorm(zillow_pmelt$ProportionChange)
summary(zillow_pmelt$ProportionChange)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -2.90 -0.13 0.18 0.20 0.54 2.45 13
var(zillow_pmelt$ProportionChange, na.rm=TRUE)
## [1] 0.362
sd(zillow_pmelt$ProportionChange, na.rm=TRUE)
## [1] 0.602
Finally, to check if the average proportional change of the cities were different from one another, we used an anova test with the average proportional changes against each other. Since we have a p-value less than our alpha, there were significant differences in changes, and thus, we looked at the Post-hoc Tukey HSD as well. The pairs that have significantly different average proportional changes are [Detroit-LosAngeles], [Roanoke-LosAngeles], [SanFrancisco-Chicago], [Seattle-Chicago], [Richmond-Chicago], [Detroit-Houston], [Seattle-Philadelphia], [SanFrancisco-LasVegas], [Seattle-LasVegas], [Richmond-LasVegas], [Detroit-SanFrancisco], [Roanoke-SanFrancisco], [Seattle-Detroit], [Washington-Detroit], [Richmond-Detroit], [Baltimore-Seattle], [Roanoke-Seattle], and finally, [Roanoke-Richmond]. From this ANOVA test, we found that we had to reject our null hypothesis that all average proportional changes of the city were the same.
zillow_prodrop = zillow_proportion[-2]
zillow_pdmelt <- melt(zillow_prodrop, id = "Date")
names(zillow_pdmelt)[2] <- "City"
names(zillow_pdmelt)[3] <- "ProportionChange"
ggplot(zillow_pdmelt, aes(x=City, y=ProportionChange, color=City, na.rm=TRUE)) + geom_boxplot(outlier.shape=2, outlier.size=1)
pm_anova = aov(zillow_pdmelt$ProportionChange ~ zillow_pdmelt$City)
summary(pm_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## zillow_pdmelt$City 11 31.4 2.855 8.52 2.1e-14 ***
## Residuals 876 293.6 0.335
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From our plots and tests, we can see that it is very likely that there are differences in the rental price change in different countries, and thus, opens the possibility of useful predictive modeling to project where our past information could lead us. We have answered our original SMART questions in our time frame with our limited resources, and took into account the possible biases and constraints. And from the initial plots and EDA, we can see the existence of unique cities like San Francisco and NYC that we will account for in our future models.
Davey, M., & Walsh, M. W. (2013, July 18). Billions in debt, Detroit tumbles into insolvency. The New York Times. Retrieved November 1, 2022, from https://www.nytimes.com/2013/07/19/us/detroit-files-for-bankruptcy.html
Karlinsky, S., & Wang, K. (2021, April 1). What it will really take to create an affordable Bay Area - Spur. What It Will Really Take to Create an Affordable Bay Area. Retrieved November 2, 2022, from https://www.spur.org/sites/default/files/2021-05/SPUR_What_It_Will_Really_Take_To_Create_An_Affordable_Bay_Area_Report.pdf
ggplot(zillow, aes(x=State)) + geom_bar(colour="blue", fill="purple", alpha=0.6)
num_city = 10
values=head(zillow, num_city, )
values=data.frame(t(as.matrix(values[,7:81])))
colnames(values)=zillow[1:num_city,2]
date = seq(as.Date("2010/11/01"), as.Date("2017/01/31"), "month")
date = as.yearmon(date)
ts=zoo(values,order.by = date)
values=fortify(ts)
values$Index=as.Date(values$Index)
autoplot(ts,facets = NULL)+ geom_point(size=0.5) +
theme_minimal()+
labs(x="Time",y="Price")
price = gather(data = zillow, "Month", "Price", 7:81, factor_key = T)
ggplot(price[price$State == "CA" & !is.na(price$Metro),]) +
geom_boxplot(aes(x = fct_rev(Metro),
y = as.numeric(Price)), fill="#FF9999", color="#56B4E9", outlier.size = 0.5) +
labs(x = "Metro Area", y = "Price", #note these are opposite because I use coord_flip to flip the axes
colour = "#E0E0E0") +
coord_flip()
price = gather(data = zillow, "Month", "Price", 7:81, factor_key = T)
ggplot(price[price$State == "PA" & !is.na(price$Metro),]) +
geom_boxplot(aes(x = fct_rev(Metro),
y = as.numeric(Price)), fill="#9999CC", color="#66CC99", outlier.size = 0.5) +
labs(x = "Metro Area", y = "Price", #note these are opposite because I use coord_flip to flip the axes
colour = "#E0E0E0") +
coord_flip()
# price
ggplot(price[price$County == "Dallas" & !is.na(price$City),],
aes(y = fct_rev(City), x = Price)) +
labs(title = "Zillow | Dallas County Rent Prices",
subtitle = "2010 - 2017",
x = "Price", y = "Dallas County",
colour = "Time") +
geom_point(shape = 20, alpha = 0.4, size = 3, aes(color = Month)) +
scale_color_discrete(l = 45, h = c(30, 330))
ggplot(price[price$County == "Los Angeles" & !is.na(price$City),],
aes(y = fct_rev(City), x = Price)) +
labs(title = "Zillow | Los Angeles County Rent Prices",
subtitle = "2010 - 2017",
x = "Price", y = "LA County",
colour = "Time") +
geom_point(shape = 16, alpha = 0.4, size = 3, aes(color = Month)) +
scale_colour_viridis_d()
# scale_colour_gradientn(colours=rainbow(4))
# Create variable of numeric year
price$Year = as.character(price$Month)
price$Year = as.numeric(substr(price$Month, nchar(as.character(price$Month)) - 3, nchar(as.character(price$Month))))
# Calculate range for each state, by year
states = price[!is.na(price$State),] %>%
group_by(State, Year) %>%
summarise(Mean = round(mean(Price),0),
Min = min(Price),
Max = max(Price))
# Plot change over time, by state.
ggplot(states[!is.na(states$Mean),],
aes(x = fct_rev(State),
y = as.numeric(Mean))) +
labs(title = "Zillow | Mean US Rent Prices",
subtitle = "2010 - 2017",
x = "States", y = "Average Price",
colour = "#E0E0E0") +
geom_point(shape = 20, alpha = 0.8, size = 5, aes(color = Year)) +
scale_color_continuous(aes(guide = ""), low = "black", high = "purple") +
guides(fill = guide_colourbar(barwidth = 0.7, barheight = 15)) +
coord_flip()
num_city = 5
values=head(zillow, num_city, )
values=data.frame(t(as.matrix(values[,7:81])))
colnames(values)=zillow[1:num_city,2]
date = seq(as.Date("2010/11/01"), as.Date("2017/01/31"), "month")
date = as.yearmon(date)
ts=zoo(values,order.by = date)
values=fortify(ts)
values$Index=as.Date(values$Index)
autoplot(ts,facets = NULL)+ geom_hex(size=1.5, alpha=0.7) + scale_fill_viridis_c() +
guides(fill = guide_colourbar(barwidth = 0.7, barheight = 15)) +
theme_minimal()+
labs(x="Time",y="Price")